-
Notifications
You must be signed in to change notification settings - Fork 134
fix(ds query): isolate temp table names #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR isolates temporary table names by resetting the list on each cloned dataset and ensuring no duplicates, then performs cleanup on the cloned instance rather than the original. Sequence diagram for updated query execution and cleanupsequenceDiagram
participant "Dataset"
participant "Cloned Dataset"
participant "Metastore"
participant "Warehouse"
"Dataset"->>"Cloned Dataset": clone(new_table=True)
"Cloned Dataset"->>"Cloned Dataset": apply_steps()
"Cloned Dataset"->>"Metastore": cleanup_tables(temp_table_names)
"Cloned Dataset"->>"Warehouse": cleanup_tables(temp_table_names)
Class diagram for Dataset temp_table_names handlingclassDiagram
class Dataset {
+List temp_table_names
+clone(new_table=True)
+cleanup()
+exec()
}
Dataset : clone() resets temp_table_names to []
Dataset : cleanup() asserts no duplicates in temp_table_names
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Deploying datachain-documentation with
|
| Latest commit: |
f5a0558
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://aab39580.datachain-documentation.pages.dev |
| Branch Preview URL: | https://isolate-temp-table-names.datachain-documentation.pages.dev |
ee46cea to
be6eeaf
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1321 +/- ##
=======================================
Coverage 87.72% 87.72%
=======================================
Files 160 160
Lines 14995 14997 +2
Branches 2156 2156
=======================================
+ Hits 13154 13156 +2
Misses 1349 1349
Partials 492 492
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Fixes for the tests in separate PR: #1322 |
yep, thanks @dreadatour ... I'll keep looking into this PR ... it is probably right for the current approach with temp tables, but I need to understand the whole temp table mechanics a bit better |
33e80f8 to
f91da0a
Compare
f91da0a to
f5a0558
Compare
| # This is needed to always use a new connection with all metastore and warehouse | ||
| # implementations, as errors may close or render unusable the existing | ||
| # connections. | ||
| assert len(self.temp_table_names) == len(set(self.temp_table_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C: trying to add this instead of tests (we should not be getting duplicates)
In tests we have additional check that no tables left behind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dreadatour not sure if it makes to add a specific complicated test ... this should be working better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated: might also fix #722
When we run multiple joins, within the same chain, due to a recursive line:
in
SQLJoin(link)we might end up with a list of 8K+ items, with a lot a lot of duplicates.
It means query can run very long at the end.
Script to reproduce this. Mind we run show and save at the end, essentially also means we are doubling the list.
TODO:
Summary by Sourcery
Isolate temporary table names per dataset clone and enforce no duplicates to prevent runaway temp table list growth, and fix cleanup to target the cloned query instance.
Bug Fixes: